A new screening tool for SARS-CoV-2 infection based on self-reported patient clinical characteristics: the COV19-ID score

Background While several studies aimed to identify risk factors for severe COVID-19 cases to better anticipate intensive care unit admissions, very few have been conducted on self-reported patient symptoms and characteristics, predictive of RT-PCR test positivity. We therefore aimed to identify those predictive factors and construct a predictive score for the screening of patients at admission. Methods This was a monocentric retrospective analysis of clinical data from 9081 patients tested for SARS-CoV-2 infection from August 1 to November 30 2020. A multivariable logistic regression using least absolute shrinkage and selection operator (LASSO) was performed on a training dataset (60% of the data) to determine associations between self-reported patient characteristics and COVID-19 diagnosis. Regression coefficients were used to construct the Coronavirus 2019 Identification score (COV19-ID) and the optimal threshold calculated on the validation dataset (20%). Its predictive performance was finally evaluated on a test dataset (20%). Results A total of 2084 (22.9%) patients were tested positive to SARS-CoV-2 infection. Using the LASSO model, COVID-19 was independently associated with loss of smell (Odds Ratio, 6.4), fever (OR, 2.7), history of contact with an infected person (OR, 1.7), loss of taste (OR, 1.5), muscle stiffness (OR, 1.5), cough (OR, 1.5), back pain (OR, 1.4), loss of appetite (OR, 1.3), as well as male sex (OR, 1.05). Conversely, COVID-19 was less likely associated with smoking (OR, 0.5), sore throat (OR, 0.9) and ear pain (OR, 0.9). All aforementioned variables were included in the COV19-ID score, which demonstrated on the test dataset an area under the receiver-operating characteristic curve of 82.9% (95% CI 80.6%–84.9%), and an accuracy of 74.2% (95% CI 74.1%–74.3%) with a high sensitivity (80.4%, 95% CI [80.3%–80.6%]) and specificity (72.2%, 95% CI [72.2%–72.4%]). Conclusions The COV19-ID score could be useful in early triage of patients needing RT-PCR testing thus alleviating the burden on laboratories, emergency rooms, and wards. Supplementary Information The online version contains supplementary material available at 10.1186/s12879-022-07164-1.

Diaz Badial et al. BMC Infectious Diseases (2022) 22:187 hard to differentiate from a broad range of respiratory tract infections [1,2]. Diagnostic testing using realtime reverse transcription polymerase chain reaction (RT-PCR) has therefore been used to identify infected patients [3,4] Several studies aimed to identify risk factors for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection severity in order to anticipate intensive care unit (ICU) admissions .
However, few studies have been conducted on selfreported patient symptoms and characteristics, predictive of RT-PCR test positivity [4,[26][27][28][29]. Models involving loss of smell, loss of taste, cough and fever have been shown to reveal a higher infection likelihood [30,31]. Although these predictive models were built on large cohorts, the proportion of infected patients was largely overestimated and symptoms may not have been collected with precision at the time of RT-PCR testing.
The main purpose of the present study was to identify predictive factors for SARS-CoV-2 infection based on self-reported patient symptoms and medical conditions, and construct a predictive score for patient screening at admission. Given the lack of availability of RT-PCR testing and delay in results, a reliable and quick tool may help clinicians on the front line in the prioritization for screening of patients at high risk for SARS-COV-2 infection.

Study design and participants
A retrospective analysis of clinical data from 10,527 consecutive patients tested for SARS-CoV-2 infection was undertaken at the La Tour Hospital's emergency center in Geneva (Switzerland) between the 1st of August and the 30th of November 2020. Our emergency department is an academically affiliated teaching center, requisitioned for SARS-COV-2 testing by the city's health authorities. It represents the 2nd largest emergency in the city, accounting for 29,000 visits per year. All RT-PCR tests performed on patients younger than 18 years of age (n = 881, 7.9%) were excluded (Fig. 1). Since RT-PCR tests are associated with a variable false-negative rate [32], we excluded all non-final results from patients tested several times in our hospital due to worsening symptoms (n = 530, 4.8%). All incomplete forms were also excluded (n = 595, 5.4%). Ultimately, this led to the remainder of 9081 patients comprising 6871 symptomatic (75.7%) and 2210 asymptomatic (24.3%) cases with a unique final RT-PCR result for further analyses. Asymptomatic patients were tested for travelling purposes (n = 834, 9.2%), before surgery (n = 526, 5.8%), following a close contact with infected people (n = 479, 5.3%) or for other reasons (n = 371, 4.1%). This study was approved by the ethics committee of Geneva (CCER 2020-01742) and the need for informed written consent was waived owing to the urgent situation and the retrospective use of anonymized data.

RT-PCR tests
SARS-CoV-2 infection was confirmed by positive RT-PCR tests on nasopharyngeal swab specimens. Specimens were sent to and analyzed by the National Reference Center for Emergency Viral Infections (CRIVE) at the Geneva University Hospital (HUG). PCR assays were performed using the Roche's cobas ® 6800 SARS-CoV-2 analyzer (Roche Molecular Systems, Branchburg, NJ) which received CE certification and the Emergency Use Authorization (EUA) by the U.S. Food and Drug Administration (FDA).

Study variables
Each enlisted patient, filled a case report form (CRF) at the time of screening. The study variables included demographic data (age, gender, weight, height, profession) and a series of specific symptoms including cough, breathing difficulties, runny nose, sore throat, ear pain, headache, fever, muscle stiffness, back pain, diarrhea, nausea/vomiting, loss of appetite, loss of weight, loss of smell, loss of taste, dizziness, respiratory allergies and unusual fatigue. Other potential risk factors recorded included immunosuppression, diabetes, tobacco use, chronic pulmonary and heart disease, cancer as well as any history of close contact with people who have tested positive for SARS-CoV-2 infection. The data was then imported in a digital database, coded for anonymization, and stored on a secured hospital server.

Statistical analyses
For baseline characteristics, continuous variables were reported as mean ± standard deviation with median and interquartile range (IQR), while categorical variables were reported as proportions. For non-Gaussian continuous data, differences between groups were evaluated using Wilcoxon rank-sum tests (Mann-Whitney U test), while for Gaussian continuous data, differences between groups were evaluated using unpaired Student t-tests. For categorical data, differences between groups were evaluated using the Fisher exact test. Univariable and multivariable logistic regressions were performed to determine associations between self-reported patient characteristics and COVID-19 diagnosis. Authors did not use imputation methods and performed their analyses on existing and complete data, thus the presented screening tool could only be used when information about all patient symptoms and characteristics is known. Sixty percent of the study population was randomly selected and contributed to build the multivariable logistic model (60%, training dataset), while the remaining part was kept to validate (20%, validation dataset) and test the model (20%, test dataset). The variables included in the shortened multivariable regression model were identified using the least absolute shrinkage and selection operator (LASSO) method. The regularization parameter used in this method was determined using a tenfold cross-validation, and set at one standard error from the λ that minimizes classification error (λ.1se). Collinearity was assessed using the Variance Inflation Factor (VIF) for each covariate, and was deemed acceptable if the maximum VIF did not exceed 2.0. Odds ratios (OR) and the 95% CI were calculated for each independent variable. Probability of being infected by SARS-CoV-2 was calculated as follows: With "Intercept" being the regression model intercept, and β the regression coefficient related to the independent variable X (X = 0 or 1). The regression coefficient for each independent variable selected in the multivariable model was multiplied by ten, rounded up to the nearest integer value, and used to build a predictive score: The Coronavirus 2019 Identification (COV 19 -ID) score. The regression coefficients were thereafter adjusted proportionally to set the maximum of the score at 100. The Receiver operating characteristic (ROC) curve was constructed and its area under the curve (AUC) evaluated. The optimal cutoff value was then calculated on the validation dataset to discriminate between infected patients and non-infected patients with the highest sensitivity and specificity (Youden Index). Two other thresholds were additionally described in a Additional file 1 to maximize either the sensitivity or the specificity of the COV 19 -ID score. To validate the variable selection in the LASSO regression, the AUCs obtained by the COV 19 -ID score and the entire multivariable model were evaluated and compared using a paired DeLong test. The sensitivity, specificity, positive and negative predictive values (PPV and NPV), positive and negative likelihood ratio (LR+ and LR−), F1 score and the Matthews correlation coefficient (MCC) were calculated on the test dataset based on the number of true positive (TP), false negative (FN), false positive (FP) and true negative (TN) cases. A bootstrap method with 1000 random resamples of the test dataset was used to calculate the 95% confidence interval (95%CI) of all aforementioned parameters. (1) Statistical analyses were performed using R version 3.6.2 (R Foundation for Statistical Computing, Vienna, Austria). P-values < 0.05 were considered statistically significant.

Predictive factors of SARS-CoV-2 infection
Among the tested population included in this study, 2084 patients (22.9%) were diagnosed with SARS-CoV-2 infection. Compared to patients with negative test  Table 2).

Creation and validation of the COV 19 -ID score
Only twelve of the aforementioned predictors of SARS-CoV-2 infection were selected by the LASSO regression and used to create the COV 19 -ID score ( Fig. 2 and 3). The  (Fig. 4).
The COV 19 -ID score accuracy was 72.4% when maximizing both the sensitivity and specificity (cutoff value of ≥ 14 points). The sensitivity and specificity were 75.4% and 71.5% respectively, with a PPV of 43.5% and an NPV of 90.9% (Table 3). The F1 score and MCC were 0.55 and 0.40 respectively. Two other COV 19 -ID score thresholds were calculated to maximize either the sensitivity (≥ 8.5 points) or the specificity (≥ 25 points).

Discussion
The rapid spread of the COVID-19 pandemic and the need for mass testing invariably overwhelms laboratory capabilities resulting in increased result delays. To date, proposed screening tools [33,34] mainly concern the detection of severe cases in order to anticipate for ICU admissions . However, screening for SARS-CoV-2 infection at admission, may help discriminate between highly suspected patients needing quarantine measures or admission to COVID-19 dedicated units from those who could safely be discharged [35], while test results are pending. Our study presented and validated a new clinical tool (COV 19 -ID score) for SARS-CoV-2 infection based on the patient's self-reported symptoms and medical history.
With an AUC of 83%, a sensitivity of 80% and a specificity of 72% for the prediction of SARS-CoV-2 infection in our test dataset, our screening tool compares well with the model of Menni et al. [31] who reported an AUC of 76% (sensitivity, 65%; specificity, 78%) in a United States cohort with a comparable proportion of infected patients (26% vs 24% in our test dataset). In their study, Zavascki et al. [36] created a score which included only 5 variables (patient age ≥ 60 years old, fever, dyspnea, coryza, and fatigue) that demonstrated an AUC of 88% in their validation dataset. It is worth noting however, that they did not use an external database for the validation process and that important symptoms such as loss of taste and loss of smell were not reported and incorporated in their model. In our study, 47% of the patients predicted of being infected by SARS-COV-2, truly had a positive test. This PPV is lower than that reported by Menni et al. [31] (69%). On the other hand, 92% of the patients predicted as not being infected had a negative test which is higher than that reported in the above study [31] (75%). These comparisons, however, should be interpreted with caution considering the differences in the studied population (e.g. the proportion of infected cases) and the cut-off value chosen for the prediction.
Main symptoms reported by COVID-19 patients included loss of smell, loss of taste, fever, muscle stiffness, back pain and loss of appetite. Known as a risk factor for transmission of the disease, exposure to a contagious person was only found in less than half of infected patients. This emphasizes the role of asymptomatic viral transmission in the population and the need for enhanced compliance with barrier measures. Although breathing difficulties has been largely described as one of the most prevalent symptoms associated with COVID-19 [37], our study revealed that in absence of cofounding factors, this symptom was rather suggestive of a non-SARS-CoV-2 infection. This finding was contradictory with those of Romero-Gameros et al. [38] but corroborated several recent studies that described a possible association between SARS-CoV-2 infection and lack of dyspnea (silent hypoxia) due to neurological damages [39,40]. Another explanation would be that patients who did not present clinical signs suggestive of COVID-19, reported dyspnea because of other type of pneumonia or simply stress/anxiety before RT-PCR testing. Patients with sore throat and/or ear pain were also less likely to be infected by SARS-CoV-2, suggesting that these symptoms are more specific to other ears, nose and throat (ENT) diseases. Likewise, Spechbach et al. reported breathing difficulties and sore throat as predictors of a negative RT-PCR test [41]. Recent studies indicated that smokers tended to be less infected [4,30,42]. Our results corroborate these findings given that the odds of SARS-CoV-2 infection was two times less important for smokers. Twelve variables were selected for the construct of the COV 19 -ID score owing to their high independent explanatory effect on SARS-CoV-2 infection. Among them, nine were potent risk factors for infection; comprising male sex, cough, loss of smell, loss of taste, fever, muscle stiffness, back pain, loss of appetite, and history of close contact with infected people. Our results are very similar to those published by Spechbach et al. who found that anosmia, fever, muscle pain, and cough were strong COVID-19 predictors [41]. In Menni et al. 's prediction model [31], loss of smell and taste, severe or persistent cough as well as loss of appetite were also highly predictive. Likewise, Apra et al. [30] reported that anosmic or ageusic patients were more likely to be infected but suggested to prioritize RT-PCR tests in patients with cough. Mao et al. [27] also found that exposure history was an independent risk factor for SARS-CoV-2 infection. Fever is usually one of the most reported symptoms in COVID-19 patients [37,43]. In some studies, notably if performed in fever clinics [27], this symptom is so frequently reported (> 80%) in the global tested population that it does not help in the identification of COVID-19 patients. However, in a context of massive testing in a standard hospital, we showed that fever was reported by less than 20% of the symptomatic population. Our analyses revealed it to be the second most important factor associated with SARS-CoV-2 infection (behind loss of smell) at patient admission. It is worth noting that among all the aforementioned clinical signs, non-flu-like symptoms such as loss of smell or loss of taste are often considered in the screening process for SARS-CoV-2 infection owing to their greater specificity [44][45][46].
Although excluded variables from our model were not predictive factors of COVID-19, they could be of great interest in the prediction of infection severity and should still be considered during the medical encounter. For instance, the association between diabetes and the severity/mortality of patients with COVID-19 is well documented [47,48] although this medical condition is not a risk factor per se for SARS-CoV-2 infection. Similarly, identified protective factors for SARS-CoV-2 infection might become a risk factor for COVID-19 severity. In our study, smoking was more likely to be considered as a protective factor for RT-PCR positivity, but it nonetheless contributes to COVID-19 severity once the patient is infected [49][50][51][52].
As to the use of this model in clinical practice, we suggest keeping the patients blinded to the score at the time of symptoms screening. Otherwise, patients might be tempted to report symptoms that are either strongly related or not to SARS-CoV-2 infection thereby reducing the diagnostic performance of the COV 19 -ID score. Furthermore, patients are unfamiliar to medical jargon and the medical lexicon used to describe the symptoms needs to be adapted to the population understanding for appropriate data collection (e.g. anosmia = loss of smell; ageusia = loss of taste, etc.). The strength of this score is its use at the time of admission. Solely based on patientself reported information, it requires no health personnel assistance. Compared to models using laboratory and/ or imaging data [53], this score is rapidly obtainable and does not require ancillary testing and/or patient radiation. Clinical uses of the COV 19 -ID score in a strained environment are large. Patients can be screened at admission and according to their score, directed to waiting areas planned for patients at low and high risk for SARS-CoV-2 infection thus preventing cross contamination [54]. Physicians or senior nurses can be appointed to patients at high risk areas thus optimizing resources. RT-PCR tests for patients at high risk could be prioritized to reduce result delays and the burden on laboratory facilities. Patients for whom a first test is negative but with a high COV 19 -ID score can be scheduled for a second test to decrease false negative results. For the same purpose, RT-PCR tests (gold standard) could also be used instead of rapid antigenic tests when patients present a COV 19 -ID score above a certain threshold (e.g. ≥ 25 points). Finally, a discriminating tool such as COV 19 -ID score has the potential to be incorporated in decision making algorithms used in telemedicine diagnostic strategies.

Limitations
This retrospective study has several limitations. First, the number of patients with confirmed SARS-CoV-2 infection may be underestimated notably because of the suboptimal sensitivity of RT-PCR tests. To this date, the RT-PCR test remains the gold standard for SARS-COV-2 detection, although specimen sampling was refined and test turnaround times shortened. Although first repeated tests for patients with symptoms aggravation were excluded from the database, a number of patients with false negative results could still remain in the datasets thereby weakening the analyses. Second, a non-negligible  rate of incomplete forms was excluded from our database (5%). However, the proportion of infected patients in the missing data was comparable to that of the studied dataset (21.5% vs 22.9%) and should therefore not represent an important bias. Our sample size may be criticized compared to multicentric or nationwide studies, however, we built our analysis on real data, gathered at the time of specimen collection, without using imputation methods for missing values. Third, the COV 19 -ID score was constructed from a local and homogeneous population and therefore needs to be validated prospectively in other populations. Furthermore, due to the retrospective nature of our study, we could not evaluate the diagnostic performance of the COV 19 -ID score on new COVID-19 variants (that may present non-classical symptoms) and on a vaccinated population. Fourth, since the statistical model used in this study did not include all patient symptoms and clinical characteristics, confounding effects that are unaccounted for could still be at play. Although we did not observe a relevant difference in terms of time since symptoms onset between infected and non-infected patients, such a factor should be further analyzed to reduce false negative results. Fifth, the COV 19 -ID score was established on data collected between August and November where COVID-19 was the predominant circulating virus. Because the seasonality has a considerable impact on the onset of viral diseases other than COVID-19; late spring, early summer and winter viruses such as the influenza virus may trigger flu like symptoms thus weakening the diagnostic performance of the COV 19 -ID score and increasing false positive rates (lower specificity). Further studies are therefore needed to estimate the impact of seasonality on the use of the COV 19 -ID score. Finally, the use of the COV 19 -ID score in a context of massive testing may be associated with a higher false negative rate at the time of RT-PCR testing (lower sensitivity) due to a higher proportion of infected patients that may not present the majority of COVID-19 predictive factors yet.

Conclusions
This study presented and validated a new screening tool (the COV 19 -ID score) for SARS-CoV-2 infection detection based on patients self-reported symptoms and medical history. This score has an acceptable diagnostic performance and might be useful in early triage of patients needing RT-PCR testing thus hopefully alleviating the burden on laboratories, emergency rooms, and wards.